해당 자료는 전북대학교 최규빈 교수님 2023학년도 2학기 빅데이터분석특강 자료임

02wk-007: 타이타닉, Autogluon (Fsize,Drop)

최규빈
2023-09-12

1. 강의영상

https://youtu.be/playlist?list=PLQqh36zP38-wiSZXhNO5rMncu6h42SNDi&si=fmqkO_EQek1SgbNQ

2. Import

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv

#pip install autogluon

from autogluon.tabular import TabularDataset, TabularPredictor

3. 분석의 절차

A. 데이터

- 비유: 문제를 받아오는 과정으로 비유할 수 있다.

tr = TabularDataset("~/Desktop/titanic/train.csv")
tst = TabularDataset("~/Desktop/titanic/test.csv")

- 피처엔지니어링

_tr = tr.eval('Fsize = SibSp + Parch').drop(['SibSp','Parch'],axis=1)
_tst = tst.eval('Fsize = SibSp + Parch').drop(['SibSp','Parch'],axis=1)

B. Predictor 생성

- 비유: 문제를 풀 학생을 생성하는 과정으로 비유할 수 있다.

predictr = TabularPredictor("Survived")

No path specified. Models will be saved in: "AutogluonModels/ag-20230917_141245/"

C. 적합(fit)

- 비유: 학생이 공부를 하는 과정으로 비유할 수 있다.

- 학습

predictr.fit(_tr) # 학생(predictr)에게 문제(tr)를 줘서 학습을 시킴(predictr.fit())

Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230917_141245/"
AutoGluon Version:  0.8.2
Python Version:     3.8.18
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail:   775.53 GB / 982.82 GB (78.9%)
Train Data Rows:    891
Train Data Columns: 10
Label Column: Survived
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
    2 unique label values:  [0, 1]
    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    38785.27 MB
    Train Data (Original)  Memory Usage: 0.31 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
        Fitting CategoryFeatureGenerator...
            Fitting CategoryMemoryMinimizeFeatureGenerator...
        Fitting TextSpecialFeatureGenerator...
            Fitting BinnedFeatureGenerator...
            Fitting DropDuplicatesFeatureGenerator...
        Fitting TextNgramFeatureGenerator...
            Fitting CountVectorizer for text features: ['Name']
            CountVectorizer fit with vocabulary size = 8
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('float', [])        : 2 | ['Age', 'Fare']
        ('int', [])          : 3 | ['PassengerId', 'Pclass', 'Fsize']
        ('object', [])       : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
        ('object', ['text']) : 1 | ['Name']
    Types of features in processed data (raw dtype, special dtypes):
        ('category', [])                    : 3 | ['Ticket', 'Cabin', 'Embarked']
        ('float', [])                       : 2 | ['Age', 'Fare']
        ('int', [])                         : 3 | ['PassengerId', 'Pclass', 'Fsize']
        ('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
        ('int', ['bool'])                   : 1 | ['Sex']
        ('int', ['text_ngram'])             : 9 | ['__nlp__.henry', '__nlp__.john', '__nlp__.master', '__nlp__.miss', '__nlp__.mr', ...]
    0.2s = Fit runtime
    10 features in original data used to generate 27 features in processed data.
    Train Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.17s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/torch/cuda/__init__.py:497: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7f2dd48085e0>
Traceback (most recent call last):
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 400, in match_module_callback
    self._make_module_from_path(filepath)
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
    module = module_class(filepath, prefix, user_api, internal_api)
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 606, in __init__
    self.version = self.get_version()
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 646, in get_version
    config = get_config().split()
AttributeError: 'NoneType' object has no attribute 'split'
    0.6536   = Validation score   (accuracy)
    0.57s    = Training   runtime
    0.03s    = Validation runtime
Fitting model: KNeighborsDist ...
Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7f2dd230f0d0>
Traceback (most recent call last):
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 400, in match_module_callback
    self._make_module_from_path(filepath)
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
    module = module_class(filepath, prefix, user_api, internal_api)
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 606, in __init__
    self.version = self.get_version()
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 646, in get_version
    config = get_config().split()
AttributeError: 'NoneType' object has no attribute 'split'
    0.6536   = Validation score   (accuracy)
    0.02s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBMXT ...
    0.8101   = Validation score   (accuracy)
    0.21s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBM ...
    0.8268   = Validation score   (accuracy)
    0.21s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: RandomForestGini ...
    0.8156   = Validation score   (accuracy)
    0.29s    = Training   runtime
    0.03s    = Validation runtime
Fitting model: RandomForestEntr ...
    0.8212   = Validation score   (accuracy)
    0.27s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: CatBoost ...
    0.8268   = Validation score   (accuracy)
    0.56s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: ExtraTreesGini ...
    0.8045   = Validation score   (accuracy)
    0.27s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: ExtraTreesEntr ...
    0.7989   = Validation score   (accuracy)
    0.29s    = Training   runtime
    0.03s    = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 9: early stopping
    0.8268   = Validation score   (accuracy)
    0.62s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: XGBoost ...
    0.8212   = Validation score   (accuracy)
    0.2s     = Training   runtime
    0.0s     = Validation runtime
Fitting model: NeuralNetTorch ...
    0.838    = Validation score   (accuracy)
    1.67s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBMLarge ...
    0.8268   = Validation score   (accuracy)
    0.34s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    0.8603   = Validation score   (accuracy)
    0.33s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 6.31s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230917_141245/")

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7f2da6b8c490>

- 리더보드확인 (모의고사 채점)

predictr.leaderboard()

                  model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2   0.860335       0.039553  2.707510                0.000509           0.333600            2       True         14
1        NeuralNetTorch   0.837989       0.007419  1.666336                0.007419           1.666336            1       True         12
2         LightGBMLarge   0.826816       0.002828  0.336571                0.002828           0.336571            1       True         13
3              LightGBM   0.826816       0.003033  0.210566                0.003033           0.210566            1       True          4
4              CatBoost   0.826816       0.003410  0.561348                0.003410           0.561348            1       True          7
5       NeuralNetFastAI   0.826816       0.006794  0.623764                0.006794           0.623764            1       True         10
6               XGBoost   0.821229       0.004909  0.202243                0.004909           0.202243            1       True         11
7      RandomForestEntr   0.821229       0.022999  0.271596                0.022999           0.271596            1       True          6
8      RandomForestGini   0.815642       0.025080  0.294840                0.025080           0.294840            1       True          5
9            LightGBMXT   0.810056       0.002956  0.207451                0.002956           0.207451            1       True          3
10       ExtraTreesGini   0.804469       0.022679  0.268605                0.022679           0.268605            1       True          8
11       ExtraTreesEntr   0.798883       0.025636  0.289557                0.025636           0.289557            1       True          9
12       KNeighborsDist   0.653631       0.013097  0.015496                0.013097           0.015496            1       True          2
13       KNeighborsUnif   0.653631       0.033604  0.567649                0.033604           0.567649            1       True          1

	model	score_val	pred_time_val	fit_time	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	WeightedEnsemble_L2	0.860335	0.039553	2.707510	0.000509	0.333600	2	True	14
1	NeuralNetTorch	0.837989	0.007419	1.666336	0.007419	1.666336	1	True	12
2	LightGBMLarge	0.826816	0.002828	0.336571	0.002828	0.336571	1	True	13
3	LightGBM	0.826816	0.003033	0.210566	0.003033	0.210566	1	True	4
4	CatBoost	0.826816	0.003410	0.561348	0.003410	0.561348	1	True	7
5	NeuralNetFastAI	0.826816	0.006794	0.623764	0.006794	0.623764	1	True	10
6	XGBoost	0.821229	0.004909	0.202243	0.004909	0.202243	1	True	11
7	RandomForestEntr	0.821229	0.022999	0.271596	0.022999	0.271596	1	True	6
8	RandomForestGini	0.815642	0.025080	0.294840	0.025080	0.294840	1	True	5
9	LightGBMXT	0.810056	0.002956	0.207451	0.002956	0.207451	1	True	3
10	ExtraTreesGini	0.804469	0.022679	0.268605	0.022679	0.268605	1	True	8
11	ExtraTreesEntr	0.798883	0.025636	0.289557	0.025636	0.289557	1	True	9
12	KNeighborsDist	0.653631	0.013097	0.015496	0.013097	0.015496	1	True	2
13	KNeighborsUnif	0.653631	0.033604	0.567649	0.033604	0.567649	1	True	1

- validation set의 의미:

D. 예측 (predict)

- 비유: 학습이후에 문제를 푸는 과정으로 비유할 수 있다.

- training set 을 풀어봄 (predict) \(\to\) 점수 확인

(tr.Survived == predictr.predict(_tr)).mean()

0.9438832772166106

- test set 을 풀어봄 (predict) \(\to\) 점수 확인 하러 캐글에 결과제출

tst.assign(Survived = predictr.predict(_tst)).loc[:,['PassengerId','Survived']]\
.to_csv("autogluon(Fsize,Drop)_submission.csv",index=False)